Screen and transform the data to make them more suitable for structure and parameter learning.
# discretize continuous data into factors.
discretize(data, method, breaks = 3, ordered = FALSE, ..., debug = FALSE)
# screen continuous data for highly correlated pairs of variables.
dedup(data, threshold, debug = FALSE)
discretize()
returns a data frame with the same structure (number of
columns, column names, etc.) as data
, containing the discretized
variables.
dedup()
returns a data frame with a subset of the columns of data
.
a data frame containing numeric columns (for dedup()
) or a
combination of numeric or factor columns (for discretize()
).
a numeric value between zero and one, the absolute correlation used a threshold in screening highly correlated pairs.
a character string, either interval
for interval
discretization, quantile
for quantile discretization
(the default) or hartemink
for Hartemink's pairwise mutual
information method.
an integer number, the number of levels the variables will be discretized into; or a vector of integer numbers, one for each column of the data set, specifying the number of levels for each variable.
a boolean value. If TRUE
the discretized variables are
returned as ordered factors instead of unordered ones.
additional tuning parameters, see below.
a boolean value. If TRUE
a lot of debugging output is
printed; otherwise the function is completely silent.
Marco Scutari
discretize()
takes a data frame as its first argument and returns a
secdond data frame of discrete variables, transformed using of three methods:
interval
, quantile
or hartemink
. Discrete variables are
left unchanged.
The hartemink
method has two additional tuning parameters:
idisc
: the method used for the initial marginal discretization
of the variables, either interval
or quantile
.
ibreaks
: the number of levels the variables are initially
discretized into, in the same format as in the breaks
argument.
It is sometimes the case that the quantile
method cannot discretize one
or more variables in the data without generating zero-length intervals because
the quantiles are not unique. If method = "quantile"
,
discretize()
will produce an error. If method = "quantile"
and
idisc = "quantile"
, discretize()
will try to lower the number of
breaks set by the ibreaks
argument until quantiles are distinct. If
this is not possible without making ibreaks
smaller than breaks
,
discretize()
will produce an error.
dedup()
screens the data for pairs of highly correlated variables, and
discards one in each pair.
Both discretize()
and dedup()
accept data with missing values.
Hartemink A (2001). Principled Computational Methods for the Validation and Discovery of Genetic Regulatory Networks. Ph.D. thesis, School of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.
data(gaussian.test)
d = discretize(gaussian.test, method = 'hartemink', breaks = 4, ibreaks = 10)
plot(hc(d))
d2 = dedup(gaussian.test)
Run the code above in your browser using DataLab